Date: November 19, 2025
Status: 🔴 CRITICAL ISSUES IDENTIFIED
Comprehensive audit of ACM analytics pipeline reveals multiple critical issues with failure predictions, missing outputs, and inconsistent thresholds. While enhanced forecasting is partially integrated, many analytics outputs are calculated but not written to database.
UPDATE (Nov 19, 2025): Critical issues #1-4 have been RESOLVED. See resolution details below.
ALL failure probabilities were identical: 0.4234 (42.34%)
-- Every single forecast has the same value!
Timestamp FailureProb Method
2023-10-21 00:00:00 0.4233724705 AR1_Health
2023-10-20 23:30:00 0.4233724705 AR1_Health
2023-10-20 23:00:00 0.4233724705 AR1_Health
... (all 48 rows identical)
The failure probability calculation in forecast.py or output_manager is broken. It's outputting a static value instead of dynamic predictions based on health forecast trajectory.
Failure probability should:
| Timestamp | HealthForecast | FailureProb (Current) | FailureProb (Expected) |
|---|---|---|---|
| 2023-10-20 00:00 | 87.84 | 42.3% | ~0-5% (healthy) |
| 2023-10-19 23:30 | 68.63 | 42.3% | ~60-70% (near threshold) |
| 2023-10-19 22:00 | 45.21 | 42.3% | ~85-95% (deep alert) |
Fixed: Removed np.maximum.accumulate() from failure probability calculation in core/rul_estimator.py (line 361)
Root Cause: The accumulate function was forcing monotonic increasing probabilities, causing all values to converge to the same maximum.
Validation: After fix, failure probabilities now vary correctly:
Files Changed:
core/rul_estimator.py: Removed accumulate, added explanatory commentenhanced_forecasting_sql.py exists and is called ✓ACM_EnhancedFailureProbability_TS table exists ✓enhanced_rul_estimator.py) is NEVER CALLED# File: core/enhanced_rul_estimator.py
# Status: EXISTS but UNUSED in pipeline
class EnhancedRULEstimator:
"""
Multi-model RUL estimation with uncertainty quantification.
- Combines degradation models
- Provides confidence intervals
- Sensor-level failure attribution
"""
Fixed: Integrated enhanced RUL estimator with config-based activation
Implementation:
np.maximum.accumulate bug in core/enhanced_rul_estimator.py (same issue as standard RUL)cfg.rul.use_enhanced config flag in configs/config_table.csv (default: False)core/acm_main.py to conditionally import and use enhanced estimatorUsage:
# Set in config to enable:
cfg.rul.use_enhanced = True
# Log confirmation:
"[RUL] Using enhanced RUL estimator (adaptive learning enabled)"
Enhanced Features (when enabled):
Files Changed:
core/enhanced_rul_estimator.py: Bug fix + explanatory commentcore/acm_main.py: Conditional import and usage logicconfigs/config_table.csv: Added rul.use_enhanced = FalseAsset health used "WATCH" (70-85) but sensors used "WARN" (z=2-3)
Different terminology for the same concept causes confusion:
| Metric | Thresholds | Terminology |
|---|---|---|
| Asset Health Index | 85+ / 70-85 / <70 | HEALTHY / CAUTION / ALERT |
| Sensor Z-Scores | <2 / 2-3 / 3+ | NORMAL / WARN / ALERT |
| Dashboard Labels | - | "Watch Condition" displayed |
Asset Health: 68.6 → Zone: ALERT (because < 70)
Sensor DEMO.SIM.06T33-1: z=2.09 → Level: WARN (because 2-3)
Dashboard shows: "Watch Condition"
Question: Should we say WATCH or CAUTION or WARN? Pick ONE term across all outputs.
Fixed: Standardized terminology to CAUTION across all outputs
Changes Applied:
output_manager.py:
HEALTH_WATCH_THRESHOLD → HEALTH_CAUTION_THRESHOLDanomaly_level() to return "CAUTION" instead of "WARN"SEVERITY_COLORS mapping to include CAUTION (#f59e0b)grafana_dashboards/asset_health_dashboard.json:
Standardized Zones:
Validation: Grep searches confirmed all user-facing instances updated
SELECT RUL_Hours FROM ACM_RUL_Summary WHERE EquipID=1
-- Result: 24.0 hours (always!)
RUL calculation appears to be:
Progress: Enhanced RUL estimator integrated but not yet tested
Completed:
cfg.rul.use_enhancedPending:
cfg.rul.use_enhanced = trueNext Step: Enable enhanced RUL in config and run batch test to validate
SELECT omr_z FROM ACM_Scores_Wide WHERE EquipID=1
-- Shows: 10.0, 10.0, 10.0... (suspicious - all same value!)
This suggests OMR may be saturating or not calculating properly.
ACM_OMR_Metrics (reconstruction errors, explained variance)ACM_OMR_SensorContributions (which sensors drive residuals)ACM_OMR_Timeline (OMR score time series)SELECT * FROM ACM_PCA_Metrics WHERE EquipID=1
-- Result: 0 rows
core/correlation.py::PCASubspaceDetector has .pca attribute with all this data, but it was not being written to the database.
Fixed: Added PCA metrics output functionality
Implementation:
write_pca_metrics() function in core/output_manager.py (66 lines)core/acm_main.py after PCA fittingOutput Table: ACM_PCA_Metrics
Validation: Logs confirm "[OUTPUT] Cached pca_metrics.csv in artifact cache (11 rows)"
| Detector | Column | Status | Output Quality |
|---|---|---|---|
| AR1 | ar1_z | ✅ Running | Values vary correctly |
| PCA SPE | pca_spe_z | ✅ Running | Values vary (0.1-10) |
| PCA T² | pca_t2_z | ✅ Running | Values vary |
| Mahalanobis | mhal_z | ✅ Running | FIXED (was all 10.0) ✅ |
| IForest | iforest_z | ✅ Running | Values vary correctly |
| GMM | gmm_z | ✅ Running | Values vary correctly |
| CUSUM | cusum_z | ✅ Running | Values vary correctly |
| OMR | omr_z | ⚠️ Running | Suspicious (all ~10) 🔴 |
| Drift | drift_z | ❌ NULL | Not populated |
| HST | hst_z | ❌ NULL | Not populated |
| River HST | river_hst_z | ❌ NULL | Not populated |
Mahalanobis ALL 10.0 was caused by insufficient regularization.
Fix Applied:
core/correlation.py::MahalanobisDetector.fit()Validation: Log shows condition number reduced from 2.62e+13 to 2.63e+10 (success)
SELECT COUNT(*) FROM ACM_DetectorCorrelation WHERE EquipID=1
-- Result: 56 rows (detector-to-detector correlations)
core/correlation.py has functions for sensor correlation analysis that are never invoked in the pipeline.
SELECT COUNT(*) FROM ACM_SensorForecast_TS WHERE EquipID=1
-- Result: 360 rows (9 sensors × 40 timestamps)
-- drift_z column in ACM_Scores_Wide is NULL for all rows
SELECT drift_z FROM ACM_Scores_Wide WHERE EquipID=1
-- Result: NULL, NULL, NULL...
SELECT COUNT(*) FROM ACM_DriftSeries WHERE EquipID=1
-- Result: 194 rows (drift values exist!)
core/drift.py calculates drift and writes to ACM_DriftSeries, but the drift_z score is never populated in the main scores table.
SELECT * FROM ACM_CalibrationSummary WHERE EquipID=1
-- 16 rows (detector stats)
# Current (WRONG): Returns constant
failure_prob = some_static_calculation() # Always 0.4234
# Should be:
failure_prob = calculate_failure_probability(
forecast_health=forecast_series,
threshold=70,
horizon_hours=24
)
# Returns: 0.0 if health > 85, ~1.0 if health < 40
# Missing call in pipeline around line 3100-3200
if enable_rul and sql_client:
# After basic RUL calculation
# ADD THIS:
from core.enhanced_rul_estimator import EnhancedRULEstimator
enhanced_rul = EnhancedRULEstimator(cfg)
rul_result = enhanced_rul.estimate_rul(
health_forecast=health_forecast_df,
sensor_data=sensor_forecast_df,
regime_info=regime_info
)
output_manager.write_enhanced_rul(rul_result)
# Add new method
def write_pca_metrics(self, pca_detector, timestamp, runid, equipid):
"""Write PCA explained variance and component info."""
pca = pca_detector.pca
metrics = []
for i, var in enumerate(pca.explained_variance_ratio_):
metrics.append({
'RunID': runid,
'EquipID': equipid,
'Timestamp': timestamp,
'ComponentName': f'PC{i+1}',
'Value': var
})
self._bulk_insert('ACM_PCA_Metrics', metrics)
# After drift calculation (around line 2400)
if drift_result:
drift_score = drift_result['drift_score']
# ADD TO SCORES DICT:
scores['drift_z'] = drift_score # Currently missing!
Run these after fixes to verify:
-- 1. Failure probability should vary
SELECT
MIN(FailureProb) as MinProb,
MAX(FailureProb) as MaxProb,
AVG(FailureProb) as AvgProb,
STDEV(FailureProb) as StdProb
FROM ACM_FailureForecast_TS
WHERE EquipID=1;
-- Expected: Min < 0.1, Max > 0.8, Std > 0.15
-- 2. Mahalanobis should not saturate
SELECT
MIN(mhal_z) as MinMhal,
MAX(mhal_z) as MaxMhal,
AVG(mhal_z) as AvgMhal
FROM ACM_Scores_Wide
WHERE EquipID=1;
-- Expected: Min < 3, Max ~10, Avg 2-4
-- 3. PCA metrics should exist
SELECT COUNT(*) FROM ACM_PCA_Metrics WHERE EquipID=1;
-- Expected: > 0 rows (at least num_components)
-- 4. Drift z-score should be populated
SELECT COUNT(*)
FROM ACM_Scores_Wide
WHERE EquipID=1 AND drift_z IS NOT NULL;
-- Expected: Match row count of ACM_Scores_Wide